A Classifier for Schema Types Generated by Web Data Extraction Systems
نویسندگان
چکیده
Generating Web site schema is a core step for value-added services on the web such as comparative shopping and information integration systems. Several approaches have been developed to detect this schema. For a real web site, due to the complexity of the site schema, post process of this schema such as labeling the schema types, comparing among different schema types and generating an extractor to extract instances of a schema type is a challenge. In this paper, a new tree structured called schema-type semantic model is proposed as a classifier for a schema type. Given some instances of a schema type, HTML tags contents, DOM trees structural information and visual information of these instances are exploited for the classifier construction. Using multivariate normal distribution, the classifier can be used to compare between two different schema types; i. e. , the classifier can be used for schema mapping which is a core step of information integration. Also, the suggested classifier can be used to detect and extract instances of a schema type; i. e. , it can be used as an extractor for web data extraction systems. Furthermore, the classifier can be used to improve the performance of the schema generated by web data extraction systems; i. e. , the classifier can be used to get, as much as possible, a perfect schema. The experiments show an encourage
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملیک چارچوب نیمهنظارتی مبتنی بر لغتنامه وفقی خودساخت جهت تحلیل نظرات فارسی
With the appearance of Web 2.0 and 3.0, users’ contribution to WWW has created a huge amount of valuable expressed opinions. Considering the difficulty or impossibility of manually analyzing such big data, sentiment analysis, as a branch of natural language processing, has been highly considered. Despite the other (popular) languages, a limited number of research studies have been conducted in ...
متن کاملWeb Data Mining Using FiVaTech
In this paper, we proposed a new approach, called FiVaTech for the problem of Web data extraction. FiVaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. FiVaTech uses tree templates to model the generation of dynamic Web pages. FiVaTech can deduce the schema and templates for each individual Deep Web site, w...
متن کاملEEG Based Brain Computer Interface Hand Grasp Control: Feature Extraction Method MTCSP
Brain-Computer Interfaces (BCIs) are communication systems, which enable users to send commands to computers by using brain activity only; this activity being generally measured by Electroencephalography (EEG). BCIs are generally designed according to a pattern recognition approach, i.e., by extracting features from EEG signals, and by using a classifier to identify the user’s mental state from...
متن کاملEEG Based Brain Computer Interface Hand Grasp Control: Feature Extraction Method MTCSP
Brain-Computer Interfaces (BCIs) are communication systems, which enable users to send commands to computers by using brain activity only; this activity being generally measured by Electroencephalography (EEG). BCIs are generally designed according to a pattern recognition approach, i.e., by extracting features from EEG signals, and by using a classifier to identify the user’s mental state from...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014